The raw data from the Philadelphia Neighborhood Corpus available here:
library(devtools)
install_github("jofrhwld/UhUm")
library(UhUm)
head(um_PNC, 3)
## idstring word start_time end_time vowel_start vowel_end nasal_start
## 1 PH00-1-1- UH 24.39 24.69 24.39 24.69 NA
## 2 PH00-1-1- UH 34.96 35.24 34.96 35.24 NA
## 3 PH00-1-1- UM 37.90 38.27 37.90 38.12 38.12
## nasal_end next_seg next_seg_start next_seg_end chunk_start chunk_end
## 1 NA S 24.69 24.87 24.39 25.29
## 2 NA F 35.24 35.35 34.96 37.11
## 3 38.27 sp 38.27 38.39 37.90 38.80
## nwords sex year age ethnicity schooling transcribed total nvowels
## 1 6551 m 2000 21 i/r 14 2811 2814 3078
## 2 6551 m 2000 21 i/r 14 2811 2814 3078
## 3 6551 m 2000 21 i/r 14 2811 2814 3078
um_PNC um_PNC%>%
group_by(word, sex)%>%
summarise(n = n())%>%
ungroup()%>%
spread(sex, n)
## Source: local data frame [5 x 3]
##
## word f m
## 1 AND_UH 904 1176
## 2 AND_UM 314 153
## 3 UH 7523 9520
## 4 UM 4132 1792
## 5 UM_UH 7 2
From the FAVE transcription guidelines:
Flat (or decreasing?)
Cohorts
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.27 | 0.09 | -2.90 | 0.00 |
| fol_segC | -2.03 | 0.09 | -23.02 | 0.00 |
| fol_segV | -1.77 | 0.13 | -13.60 | 0.00 |
| decade | 0.50 | 0.04 | 12.78 | 0.00 |
| fol_segC:decade | 0.23 | 0.04 | 5.54 | 0.00 |
| fol_segV:decade | 0.20 | 0.06 | 3.18 | 0.00 |
| Df | AIC | BIC | logLik | deviance | Chisq | Chi Df | Pr(>Chisq) | |
|---|---|---|---|---|---|---|---|---|
| cre_mod2 | 5 | 9893 | 9929 | -4941 | 9883 | NA | NA | NA |
| cre_mod | 7 | 9856 | 9907 | -4921 | 9842 | 40.75 | 2 | 0 |
## Warning: Removed 19 rows containing missing values (stat_smooth).
## Warning: Removed 53 rows containing missing values (stat_smooth).
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.72 | 0.10 | -7.56 | 0.00 |
| decade | 0.59 | 0.04 | 14.00 | 0.00 |
| log2dur | 0.30 | 0.01 | 20.56 | 0.00 |
| decade:log2dur | -0.03 | 0.01 | -4.71 | 0.00 |
## is_um ~ decade * log2dur + (1 | idstring)
## <environment: 0x7fb2cddf1918>
\[\text{fp}\rightarrow\text{ə}\left<\begin{array}{c}\text{m}\\\emptyset\end{array}\right>\]
Persistence, from Tamminga (2014)
Linguists have written about:
Someone else has written about(?):
\(P(\text{ukip}~|~\text{um}) = \mathcal{M}(p,q) \approx 1\)
Hang 1 lamp if the British are coming by land, 2 if by sea.
The amount of information to be communicated depends on how likely the different outcomes are:
| by_land | by_sea | entropy |
|---|---|---|
| 0.1 | 0.9 | 0.47 |
| 0.2 | 0.8 | 0.72 |
| 0.5 | 0.5 | 1.00 |
| 0.8 | 0.2 | 0.72 |
| 0.9 | 0.1 | 0.47 |
The quality of the signal depends on how strictly it covaries with the message:
The Joint Distribution
| by land | by sea | margin | |
|---|---|---|---|
| 1 lamp | 0.64 | 0.04 | 0.68 |
| 2 lamps | 0.16 | 0.16 | 0.32 |
| margin | 0.8 | 0.2 | 1 |
The Mutual Information between message and signal:
entropy(c(0.2, 0.8)) + # message uncertainty
entropy(c(0.68, 0.32)) - # signal uncertainty
entropy(c(0.64, 0.04, # joint uncertainty
0.16, 0.16))
## [1] 0.1825
# bits that could've been could've been communicated
# with a perfect signal
entropy(c(0.2, 0.8))
## [1] 0.7219
## Warning: NAs introduced by coercion
Need to compare this to some other kind of signal.
library("babynames")
head(babynames, 3)
## Source: local data frame [3 x 5]
##
## year sex name n prop
## 1 1880 F Mary 7065 0.07238
## 2 1880 F Anna 2604 0.02668
## 3 1880 F Emma 2003 0.02052
tail(babynames, 3)
## Source: local data frame [3 x 5]
##
## year sex name n prop
## 1 2013 M Zymari 5 2.499e-06
## 2 2013 M Zymeer 5 2.499e-06
## 3 2013 M Zyree 5 2.499e-06
One off update: